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COPULA GAUSSIAN GRAPHICAL MODELS AND THEIR 
APPLICATION TO MODELING FUNCTIONAL DISABILITY 

DATA^ 

By Adrian Dobra and Alex Lenkoski 

University of Washington and Heidelberg University 

We propose a comprehensive Bayesian approach for graphical 
model determination in observational studies that can accommodate 
binary, ordinal or continuous variables simultaneously. Our new mod- 
els are called copula Gaussian graphical models (CGGMs) and embed 
graphical model selection inside a semiparametric Gaussian copula. 
The domain of applicability of our methods is very broad and encom- 
passes many studies from social science and economics. We illustrate 
the use of the copula Gaussian graphical models in the analysis of 
a 16-dimensional functional disability contingency table. 

1. Introduction. The determination of conditional independence rela- 
tionships through graphical models is a key component of the statistical 
analysis of observational studies. A pertinent example we will focus on in this 
paper is a functional disability data set extracted from the "analytic" data 
file for the National Long Term Care Survey (NLTCS) created by the Center 
of Demographic Studies at Duke University. Each observed variable is binary 
and corresponds to a measure of disability defined by an activity of daily 
living. This contingency table cross-classifies information on elderly aged 65 
and above pooled across four survey waves, 1982, 1984, 1989 and 1994 — see 
Manton, Corder and Stallard (1993) for more details. The 16 dimensions of 
this table correspond to six activities of daily living (ADLs) and ten instru- 
mental activities of daily living (lADLs). Specifically, the ADLs relate to 
hygiene and personal care: eating (ADLl), getting in/out of bed (ADL2), 
getting around inside (ADL3), dressing (ADL4), bathing (ADL5) and get- 
ting to the bathroom or using a toilet (ADL6). The lADLs relate to activities 
needed to live without dedicated professional care: doing heavy house work 
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(lADLl), doing light house work (IADL2), doing laundry (IADL3), cooking 
(IADL4), grocery shopping (IADL5), getting about outside (IADL6), trav- 
elling (IADL7), managing money (IADL8), taking medicine (IADL9) and 
telephoning (lADLlO). For each ADL/IADL measure, subjects were classi- 
fied as being either healthy (level 1) or disabled (level 2) on that measure. 
The methodology we develop in this paper allows us to determine the com- 
plex pattern of conditional associations that exist among the 16 daily living 
activities. This represents a critical issue that was left unexplored in previous 
analyses of this data set [Erosheva, Fienberg and Joutard (2007); Fienberg 
et al. (2010)]. 

In fact, the domain of applicability of our methods is not restricted to 
contingency tables. Since multivariate data sets arising from social science 
or economics typically contain variables of many types, our goal is to de- 
velop an approach to graphical model determination that is broad enough 
to be applicable to any study that involves a mixture of binary, ordinal and 
continuous variables. 

Most of the research efforts in the graphical models literature have been 
focused on multivariate normal models or on log-linear models — see, for 
example, the monographs of Lauritzen (1996) and Whittaker (1990). These 
models relate to data sets that contain exclusively continuous or categorical 
variables. CG distributions [Lauritzen (1996)] constitute the basis of a class 
of graphical models for mixed variables, but they impose an overly restrictive 
assumption: the conditional distribution of the continuous variables given 
the discrete variables must be multivariate normal. As such, the three main 
classes of graphical models are too restrictive to be widely applicable to 
social science or economics studies. 

Copulas [Nelsen (1999)] provide the theoretical framework in which mul- 
tivariate associations can be modeled separately from the univariate distri- 
butions of the observed variables. Genest and Neslehova (2007) advocate 
the use of copulas when modeling multivariate distributions involving dis- 
crete variables. In this paper we employ the Gaussian copula and further 
require conditional independence constraints on the inverse of its correlation 
matrix. The resulting models are called copula Gaussian graphical models 
(CGGMs) because they only impose a multivariate normal assumption for 
a set of latent variables which are in a one-to-one correspondence with the 
set of observed variables. A related approach for inference in Gaussian cop- 
ulas has been developed by Pitt, Chan and Kohn (2006). Their framework 
involves parametric models for Gaussian copulas and the univariate marginal 
distributions of the observed variables. We treat these marginal distributions 
as nuisance parameters and focus on the determination of graphical models. 

The structure of the paper is as follows. In Section 2 we formally introduce 
Gaussian graphical models (GGMs) and describe a Bayesian framework for 
inference in this class of models. In Section 3 we discuss modeling aspects 
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related to binary and ordinal variables. In Section 4 we show how to extend 
GGMs to represent conditional independence associations in a latent vari- 
ables space. We also present a Bayesian model averaging approach for graph 
identification and estimation in CGGMs. In Section 5 we analyze the NLTCS 
functional disability data together with another six-dimensional contingency 
table using CGGMs. We discuss our proposed methodology in Section 6. 

2. Gaussian graphical models. We let X = Xy, V = {1,2, . . . ,p}, be 
a random vector with a joint distribution p{Xv)- The conditional indepen- 
dence relationships among {X^ : v £ V} under p{Xv) can be summarized in 
a graph G = {V,E), where each vertex v €V corresponds with a random 
variable X^ and E C V x V are undirected edges [Whittaker (1990)]. Here 
"undirected" means that {vi,V2) £ E is equivalent with {v2,vi) £ E. 

The absence of an edge between X^-^ and X^^ corresponds with the con- 
ditional independence of these two random variables given the remaining 
variables under p{Xv) and is denoted by 

(2.1) Xy^ _IL Xi,2 I Xy\[jj^^y^j. 

This is called the pairwise Markov property relative to G, which in turn 
implies the local as well as the global Markov properties relative to G [Lau- 
ritzen (1996)]. 

We denote by Gy the set of all 2^*^^"^^/^ undirected graphs with vertices V. 
Since Gy contains many graphs even for relatively small values of p, it can- 
not be enumerated and has to be visited using stochastic search methods 
[Madigan and York (1995); Jones et al. (2005); Lenkoski and Dobra (2010)]. 
Such algorithms move through Gv using neighborhood sets nbd(G) C Gv for 
G G Gv- The neighborhood of a graph G € Gv is comprised of all the graphs 
obtained from G by adding or deleting one edge. These neighborhood sets 
are symmetric and link any two graphs through a path of graphs such that 
two consecutive graphs on this path are neighbors of each other. We remark 
that the neighborhood sets associated with Gv contain the same number of 
graphs p{p — l)/2. 

Furthermore, we assume that X = Xy follows a p-dimensional multivariate 
normal distribution Np(0, K~^) with precision matrix K = (-ft't;i,u2)i<fi,f2<p- 
We let x^^'") = {x^^\ . . . ,x("^)'^ be the observed data of n independent sam- 
ples of X . The likelihood function is proportional to 

(2.2) p(x(^^")|K)oc(detif)"/2exp{-l(if,[/)}, 

where U = ^'^^^x^^^x^^^'^ , and {A,B) =tT{A^B) denotes the trace inner 
product. We assume that the data have been centered and scaled, so that 
the sample mean of each X^ is zero and its sample variance is one. 

A graphical model G = (V, E) for Np{{), K~^) is called a Gaussian graphi- 
cal model (GGM) and is constructed by constraining some of the off-diagonal 
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elements K to zero. For example, the pairwise Markov property (2.1) holds 
if and only if K^-^^^,^ = 0. This implies that the edges of G correspond with the 
off-diagonal nonzero elements oi K, that is, E = {{vi,V2) \ K^^^^^ 7^ 0,fi / V2}- 
Given G, the precision matrix K is constrained to the cone Pq of symmetric 
positive definite matrices with entries K^-^^^^ equal to zero for all (vi, 'L'2) ^ E, 

We consider a G-Wishart prior Wg'(5, -D) for K with density 

(2.3) v{K\G) = j-^^ (det Kf-'^ expj - i {K, D) 

with respect to the Lebesgue measure on Pq [Roverato (2002); Atay-Kayis 
and Massam (2005); Letac and Massam (2007)]. The normalizing cons- 
tant Ici^, D) is finite provided 6 > 2 and D is positive definite [Diaconis and 
Ylvisaker (1979)]. If G is the complete graph with p vertices (i.e., there are 
no missing edges), Wc{S,D) reduces to the Wishart distribution Wp{6,D), 
hence, its normalizing constant is 

(2.4) Ig{6, D) = 2^^^P~^^P''^Tp{{5 +p- l)/2}(det 2))-('5+p-i)/2^ 

where Fpia) = tt^^p-^)/^ IlCo r(a - f) for a>{p- l)/2 [Muirhead (2005)]. 
If G is decomposable, Ig{S,D) is explicitly calculated [Roverato (2002)]. 
For nondecomposable graphs, the Monte Carlo method of Atay-Kayis and 
Massam (2005) can be used to numerically approximate Ig{S,D) in a fast 
and accurate manner. 

Throughout this paper we set the prior parameters for K to 5 = 3 and 
D = Ip, the p-dimensional identity matrix. From equations (2.2) and (2.3) 
we see that the interpretation of this prior is that the components of X are 
independent apriori and that the "weight" of the prior is equivalent to one 
observed sample. 

The G-Wishart prior is conjugate to the likelihood (2.2), thus, the poste- 
rior distribution of K given G is Wg((5 + n,D + U), that is. 

Given K S Pq, the regression of X^ on the remaining elements of X depends 
only on the neighbors of v in G: 



(2.5) p{X^\Xv\{i,}=xv\{v},K)=^i- ^ ^^P^x„> 



v'Gbdciv) 



K K 



where hdciv) = {v' £V : {v, v') G E}. 

The Cholesky decomposition of a matrix K G Pq is K = cf^cp, where (f) 
is an upper triangular matrix with (py^^ > 0, v £ V . Roverato (2002) proved 
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that the set i^(G') of the free elements of (j) consists of the diagonal elements 
together with the elements that correspond with the edges of G, that is, 

u{G) = {{vi,vi) :fi G y} U {(^1,^2) ■.vi<V2 and {vi,V2) G E}. 

Once the free elements of (p are known, the remaining elements are also 
known. More specifically, we have (j)i^v2 = if ^ 2 and (1,^2) ^ E. We also 
have 

for 2 < vi < V2 and (^1,1^2) ^ E. The determination of the elements of (j) 
that are not free based on the elements of (j) that are free is called the 
completion of <j) with respect to G [Roverato (2002); Atay-Kayis and Massam 
(2005)]. It is useful to remark that the free elements of (j) fully determine 
the matrix K. The development of our framework involves the Jacobian of 
the transformation that maps K G Pq to the free elements of (j) [Roverato 
(2002)]: 

where is the number of elements in bdc^v) H {v + I, . . . ,p}. 

3. Incorporating binary and ordinal categorical variables. A variable Xy 
that takes a finite number of ordinal values {1, 2, . . . , d^,}, with > 2, 
is incorporated in our modeling framework by introducing a continuous 
latent variable Zy underlying Xy — see, for example, Muthen (1984). We 

denote by {x^\ . . . ,x[r^} the observed samples associated with Xy. The 
samples from Zy are denoted by {zi'^\ . . . , zi""^}. Typically the relation- 
ship between Xy and its surrogate Zy is expressed through some thresholds 
Ty = (t„,o, t„,i , . . . , Ty^^iy,, ) with -00 = Tyfi < r^,i < • • • < r^,,^^ = 00. Formally, 
we set [Dunson (2006)] 

(3.1) ^^■^ = E^xW.<4%.,a' ^• = i'2'-'- 
1=1 

This model is identifiable if the value of r„^i is fixed at a certain value. We 
follow an idea originally suggested by Hoff (2007) that does not explicitly 
involve the thresholds Ty. This approach is based on the remark that the 
relationship between the observed and latent samples satisfies the constraints 

o^ T-O'i) ^ ^(32) Jji) ^ Ah) Jji) ^ Sh) Ah) < ^{h) 

yO.^ j "17 1) — V ~ 1) ' 1) " V — V V 
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for 1 < ji 7^ j2 ^ n. We see that if X^, and are related as in (3.1), then (3.2) 
holds. If (3.2) holds, then (3.1) also holds by choosing t^^i = max{zi'^^ : Xy^ = 
for / = 1, . . . , — 1. It follows that, given the observed data x^^'") , the la- 
tent samples z^^-") = [z^^\z^'^\ . . . , z^")) are constrained to belong to the set 

^(x(^^")) = G R"^P:Li(z(i^")) < < U^iz^^'''^)}, 

where 

Li(z(^^")) = ma^{zi^^ : xi^^ < 4^')}, 

(3.3) 

[/^■(z(^^")) = min{4'^) : xi^^ < x^^^}. 

If the value x^^^ is missing from the observed data, we define 4 (z(i^")) = — oo 
and C/i?'(z(i^")) = 00. 

4. Copula Gaussian graphical models. We assume that an observed vari- 
able can be binary, categorical with ordered categories, count or contin- 
uous. We denote by F„ the univariate distribution of X^ and by the 
pseudo-inverse of F^. Given a precision matrix K, we model the joint dis- 
tribution of X = Xy as follows [see also Hoff (2007)]: 

Zy~^p(0,K"l), 

(4.1) z, = zj{K~')l(l veV, 

X, = F-\^Z,)), v£V. 

In (4.1) the joint distribution of the latent variables is multivariate normal 
Z = Zy ~ iVp(0, T(i^)), where T(i^) is a correlation matrix with entries 



(4.2) T,„,,(K) 



(-^ "'^)di,D2 



The joint distribution F oi X = Xy is subsequently a function of the corre- 
lation matrix T{K) and the univariate distributions F^ of X^: 

p{Xi <xi,...,Xp<Xp) = F(xi, . . . ,Xp\T{K),Fi, . . .,Fp), 

= C{Fiixi),...,Fp{xp)\T{K)), 

where 

(4.3) Cim, . . .,Up\T') = <^p{<^~\ui), <^~\up)\T') : [0, 1]^ ^ [0, 1] 

is the Gaussian copula with p x p correlation matrix T' [Nelsen (1999)]. 
Here $(•) represents the CDF of the standard normal distribution and 
$p(-|T) is the CDF of iVp(0,T). 



COPULA GAUSSIAN GRAPHICAL MODELS 



7 



We avoid the need to formally make assumptions regarding the parametric 
representation of {Fi,:v G V}, which could be a daunting task for most 
real world data sets, by treating their marginal distributions as nuisance 
parameters. Moreover, we reduce our model parameters to the correlation 
matrix of the Gaussian copula (4.3). This means that we focus on the joint 
distribution of the latent variables Zy whose relationships with the observed 
variables Xy are given by (4.1). Since F~^{-) and <!>(•) are nondecreasing, 
(4.1) implies (3.2) which does not depend on the marginal distributions 
{F^ :v € V}. The converse is also true: if the relationship (3.2) between the 
observed and latent samples holds, then (4.1) also holds by replacing F^ 
with the empirical distribution of X^. 

As suggested by Hoff (2007), inference in the latent variables space can 
be performed by substituting the observed data x^^'"^ with the event V = 
^^{i:n) g ^(x^^'"^)}. We write the likelihood function as 

p(x(^^")|K, {F^:ve V}) = p{V\K)p{x^^--'''^\V, T{K), {F^ive V}). 

In this decomposition p{'D\K) is the only part of the observed data likeli- 
hood that is relevant for making inference on K. Furthermore, p{T>\K) does 
not depend on {F^-.v € V}. Hoff (2007) calls p{T>\K) the extended rank 
likelihood and constructs a Gibbs sampler with stationary distribution 

(4.4) p{K\V)(xp{V\K)p{K), 

where K follows a Wishart prior distribution Wp{5,D). 

We are interested in modeling the conditional independence relationships 
among the latent variables Z = Zy using Gaussian graphical models. We 
go one step further compared to Hoff (2007) and impose zero constraints in 
the precision matrix K according to a graph G. We refer to the graphical 
models constructed in the latent space as copula Gaussian graphical models 
(CGGMs). The inference approach described in Hoff (2007) is equivalent to 
reducing the set of candidate graphs to only one graph. This graph is the 
full graph in which all the edges are present and none of the off-diagonal 
elements of K are constrained to zero. 

The Markov properties associated with a CGGM are guaranteed to trans- 
late into Markov properties for the observed variables if all the marginals 
\F^ : V € V} are continuous [Liu, Lafferty and Wasserman (2009)]. The pres- 
ence of some discrete observed variables might induce additional dependen- 
cies among the X's that are not modeled in a CGGM, but such dependen- 
cies can be regarded as having a secondary relevance since they emerge from 
the marginals {F„ :v S V}. The conditional independence graphs for the la- 
tent variables could contain edges then that do not necessarily correspond 
with conditional independence relationships in the observed variables space. 
Conversely, there might exist conditional independence relationships among 
the observed variables that are not represented in conditional independence 
graphs that involve latent variables. 
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4.1. Bayesian inference in copula Gaussian graphical models. Let G G Qy 
be a graph defining a CGGM. Tlie joint posterior distribution of K € Pq 
and the graph G is given by 

(4.5) p{K, G\V) oc p{V\K)p{K\G)p{G). 

The prior distribution of K conditional on G is G-Wishart Wg{S,D) and 
the prior distribution over Qv is uniform, that is, p{G) oc 1. Other choices 
of priors on the graphs space Gy take into consideration the imphed distri- 
bution on the number of edges [Wong, Carter and Kohn (2003)], encourage 
sparsity [Jones et al. (2005)] or have multiple testing correction properties 
[Scott and Berger (2006)]. 

We describe a Markov chain Monte Carlo sampler for the joint distribu- 
tion (4.5). We consider two strictly positive precision parameters dp and ag 
that remain fixed throughout at some small values, for example, crp = ag = 
0.1. Given the current state of the chain {K^ ,G^), its next state (K**"*"^, G**"*"^) 
is generated by sequentially performing the following updates. 

Step 1: Resample the latent data. For each v £V and j E {1,2, . . . ,n}, we 

update the latent value zi, by sampling from its full conditional distribution. 
The distribution of conditional on Zy\^^j = Zv\{v} ^(Pv,(^v) truncated 

to the interval [Li,Ui], where /i„ = -Y^v'^hdaiv) lif^^v' ^^^^ = — 

see (2.5). The bounds Li and are given in (3.3). The new value of 4^') is 
obtained by sampling from this truncated normal distribution. 

Step 2: Resample the precision matrix. We sequentially perturb the free el- 
ements {(f>l_^ ^2 • (^^ii ■'^2) S ^{G'^)} ill the Cholesky decomposition K"^ = {(f)'^)'^ cf' 
around their current value. Here (p^ is upper triangular. We perform a Metro- 
polis-Hastings update of associated with a diagonal element c/)^^ > 
by sampling a value 7 from a N{(j)^_^^^_^ , Up) distribution truncated below at 0, 
that is, 

We take K' = {(j)')'^ (f)' , where (f)' is such that its free elements coincide with 
the free elements of 0** , with the exception of the {vi , vi ) element which is 
set to 7. The elements of (j)' that are not free are obtained by the completion 
operation described in Section 2. The acceptance probability of the update 
of -fC* to K' is min{i?p, 1}, where 

^ p{K'\z^^--^),G') J{K'^<t)') g(<Ag„.j7) 

V p(K^|^(l:n),G^) J(K^ ^ <^^) g(7|0^^^,J ' 
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Here we denote 

i?; = exp I - i ^i^' - I) + J2 ^^'^^^'^^ 

Next we consider a free ofF-diagonal element (p^^ , where vi < f 2 and 
{vi,V2) G t^{G^)- We sample a candidate value 7' from a N{(j)^,^ ,^^,ap) dis- 
tribution. As before, we take K' = [cf)')'^ cf)' , where cf)' and cf)^ have the same 
free elements with the exception of the ('Ui,f2) element that has (t>'y^^^ =l' ■ 
The remaining nonfree elements of (f)' are obtained through completion. Due 
to the symmetry of the proposal distribution and the fact that deti^* = 
n^=i('/'S t))^ — \^v=ii't^'vvY = det i^T', the candidate matrix K' is accepted 
with probability min{i?p,l}. 

Since i^* € Pq" , the candidate matrix K' associated with each free ele- 
ment in I'iG'^) must also belong to . The precision matrix that is obtained 
after performing all the Metropolis-Hastings updates is iT^+^Z^ g Pqs . 

Step 3: Resample the graph. We consider the Cholesky decomposition 
^s+1/2 _ ^^s+i/2-jT^s+i/2 ^j^gj-g (f)^^^/'^ is upper triangular. We randomly 
choose a pair {vi,V2), vi <V2. If there is no edge between vi and V2 in G*, 
that is, (^1,^2) ^ ^{G^), we add this edge to to obtain a candidate 
graph G'. This implies bdcivi) = hdcivi) U {^2}, hence, (i^^ = df^" + 1. 
Moreover, i^(G') = U {(vi, ^2)}. We define an upper diagonal matrix (j)' 

such that <j)'^, ^, = (/"^^^/^ for all (f^, ^2) G v{G'^). The value of </'^^^^2 

sampling from a A^((/)^^i2^, cXg) distribution. The remaining elements of (f)' 
are determined through completion with respect to the graph G' . We see 
that (f)' has one additional free element with respect to <f)^^^^'^ whose value 
was randomly chosen by perturbing the nonfree (^1,^2) element of (^*+^/^. 

We take the candidate precision matrix K' = {4>')^(l)' £ Pq' ■ Since the 
dimensionality of the parameter space increases by one, we must make 
use of the reversible jump Markov chains methodology proposed by Green 
(1995). We accept the update of {K'+^I^,G') to {K\G') with probability 
min{i?g, 1}, where Rg is given by 

p{z^^--n)\K>)p[K'\G') |nbd(G'')| 
p{zi^--^)\K'+y^)p{K^+^/^\G') |nbd(G')| 

J{K' ^ 00 J(0^+l/2 ^ 

^ J(i^^+V2 ^0.4-1/2) (l/(^^V2^))g,p(_(^,^^^^ _ ^^+1/2)2/(2^2))- 

We denote by |-B| the number of elements of a set B. All the graphs in Qv 
have the same number of neighbors, hence, |nbd(G*)| = |nbd(G')| = p{p — 
l)/2. Since the free elements of (j)' are the free elements of (f)^'^^/'^ and (j)'^^ 
the Jacobian of the transformation from (f>^^^/'^ to (j)' is equal to 1, that is, 




10 



A. DOBRA AND A. LENKOSKI 



J{(j)'^~^^^'^ (j)') = 1. Moreover, c/)**"^^/^ and (j)' have the same elements on the 
main diagonal and are upper triangular, therefore, detK'^'^^^'^ = detiT'. We 
also have 



J{K' 



^1 .^1 > _ 



It follows that Rg is equal to 



^^'''^ Ig'{6,D) 



Now we examine the case when there is an edge between vi and V2 
in . We delete this edge from to obtain a candidate graph G' . We 
have bdc'ivi) = hdc'ivi) \ {■U2}, hence, d^/ = d'^^ - 1 and v{G') = i^{G^) \ 

{(vi,V2)}- We define an upper diagonal matrix cp' such that (/)' , , = (p^t^C^ 

for all {v[,V2) G ^{G')- The {vi,V2) element is free in cf)^^^/"^, but it is no 
longer free in cj)' . The nonfree elements of (j)' are obtained by completion with 
respect to the graph G' . As before, we take K' = {(f)')'^ cf)' S Pqi . The dimen- 
sionality of the parameter space decreases by 1 as we move from (fy^'^^/'^ to 0'. 
We obtain that the acceptance probability of the update from (i<'*"^^/^,G*) 
to {K' ,G') is min{i?g, 1}, where R'g is equal to 

X exp I - i ^ A" - V2 ^ ^ + j-^ ,U) ,{m ^ _ -^t,t'vi'? I _ 

The updated graph and the corresponding precision matrix that are obtained 
at the end of this step are G^~^^ and AT*"*"^, respectively. 

We note that our strategy for updating the precision matrix and the graph 
has some similarities with the work of Giudici and Green (1999). However, 
they focused exclusively on decomposable graphs and perturbed elements of 
the covariance matrix that are either on its main diagonal or correspond 
to an edge in the graph. 

4.2. Estimation and testing in copula Gaussian graphical models. In high- 
dimensional data sets with a small number of observed samples it is likely 
that the highest posterior probability graph receives only a small (almost 
zero) posterior probability. Furthermore, changing a few edges in this graph 
could lead to graphs with comparable posterior probabilities. When model 
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uncertainty is high, Bayesian model averaging becomes key because it avoids 
the need to perform inference by making an exphcit choice about which edges 
are present or absent in the graphs that underhe the CGGMs. This choice 
is not desirable since a small sample size means lack of sufficient informa- 
tion. As such, averaging over a large number of graphs is preferable even if 
prediction is not the final goal. 

We let {{G'^,K'^,T'^):s = 1,2,...,S'} be samples from the joint distri- 
bution (4.5), where T'* is the correlation matrix corresponding with — 
see (4.2). These samples can be used to produce Monte Carlo estimates of 
functions involving the latent variables Z or the observed variables X. The 
posterior probability that two latent variables Z^-^ and Z^^ are not condi- 
tionally independent given Zy^^^^^^^y is the posterior inclusion probability 
of the edge (^1,^2) which is estimated as the proportion of graphs that 
contain the edge {vi,V2)- 

Thej)osterior expectation of the correlation matrix T is estimated by the 
mean T = ^ X]f=i '^^^ ^ ^^^o element of the correlation matrix T implies the 
independence of and Z^^ , which in turn implies the independence of Xy-^ 
and . We can conduct a Bayesian test of independence of Xy^ and Xy^ 
by considering the interval null hypothesis Hq^^^ : {Ty^^^y^l < e with the al- 
ternative H^^Y^ : \ Tyj^^y^ \ > e, where e > 0. Given equal apriori probabilities 
of the null and alternative hypotheses, the Bayes factor 

is estimated as the number of T^^ y^ whose absolute value is above e divided 
by the number of T*^ y^ whose absolute value is below e. 
The CDF of X = Xy is estimated as 

1 ^ ^ 

-Y,C{F^ix,),...,Fp{xp)\T% 

s=l 

where Fy is the empirical univariate distribution of Xy . If each observed vari- 
able is discrete and takes values {0, 1,2,.. .}, their joint probability given T 
is [Song (2000)] 

1 1 

(4.6) p{Xv = xv\r) = Y,---Yl {-iy'^-''''C{u{'{xi), . . . ,<(xp)|T), 

ii=o jp=o 

where Uy{xy) = Fy{xy) and u\,{xy) = Fy{xy — 1). We define ^^(0) = 0. For 
example, if Xy £ {0, 1} is a binary random variable, we have Uy{l) = 1 and 
Uy{0) = ul{l) = i 6, • Here 63 is 1 if i? is true and is otherwise. 

Thus, the posterior expectation of the joint probability of Xy is estimated as 

1 ^ 

p{Xv =vy) = - Y,p{Xv = xv|T^). 
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Cramer's V [Cramer (1946)] is a measure of association between two cat- 
egorical variables X^^ and X^^ that take values in the finite sets X^^ and X^^ , 
respectively, 

1 

^^^'''^-min{|X,J,|X,,|}-l 
(4.7) ^ 

Cramer's V always takes values between and 1, but we have Pvi,v2 = if and 
only if X^^ and X^,^ are independent. The posterior expectation of Pvi,v2 is es- 
timated by calculating the marginal cell value p{Xy-^ = x^^^X^^ = Xy^\T^) of 
p{Xy = xy|T*) for s = 1, 2, . . . , 5", calculating from (4.7) with respect 

to p{Xy^ =Xi,-^,Xy^ = Xy^\T^) {ov s = 1,2,...,5, then taking the average 

We can test the independence of X^^ and X^^ based on Cramer's V as 
follows. We consider the null hypothesis Hq^^^^ ■ Pvi,v2 < ^ against the al- 
ternative H\^p^ : Pvi,v2 ^ ^- The corresponding Bayes factor in favor of the 
alternative hypothesis is 

= p(/?|'^i^''2|x(^^"))/p(F^/2|x(^^")), 

where we assumed equal apriori probabilities of -ffg ^Tp^ ■ esti- 

mate Bp^'^^ as the number of /?^^ above e divided by the number of 
below £. We note that Dunson and Xing (2009) have also used Cramer's V 
to perform Bayesian testing for multivariate categorical data in a nonpara- 
metric framework. 

In the two examples discussed in Section 5 we chose to test independence 
of each pair of variables based on Cramer's V since this measure takes into 
account the univariate distributions of the observed variables. 



5. Examples. In this section we apply copula GGMs to analyze two mul- 
tivariate data sets with high relevance in the social science literature. In the 
supplementary material [Dobra and Lenkoski (2010)] we provide C-I-+ code 
and the data sets that are needed to replicate the numerical results that 
follow. 



5.1. The Rochdale data. We consider a social survey data set previously 
analyzed in Whittaker (1990) — see Table 1. This observational study was 
conducted in Rochdale and attempted to assess the relationships among 
factors affecting women's economic activity. The eight variables are as fol- 
lows: a, wife economically active (no, yes); b, age of wife > 38 (no, yes); 
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Table 1 

Rochdale data from Whittaker (1990). The cells counts appear row by row in 
lexicographical order with variable h varying fastest and variable a varying slowest. The 

grand total of this table is 665 
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c, 


husband unemployed ( 


no, 


yes); 


d, 


child 


< 4 (: 


no, 


yes); 


e, 


wife's 


educa- 



tion, high-school+ (no, yes); /, husband's education, high-school+ (no, yes); 
g, Asian origin (no, yes); h, other household member working (no, yes). The 
resulting 2^ cross-classification has 165 counts of zero, while 217 cells con- 
tain small positive counts smaller than 3. There are quite a few counts larger 
than 30 or even 50. 

Since the sample size is only 665, this table is sparse. Whittaker (1990) ar- 
gues that higher-order interactions involving more than two variables should 
not be included in any log-linear model that is fit to this data set. He sub- 
sequently studies two log-linear models: the all two-way interaction model 
whose minimal sufficient statistics are all the 28 two-way marginals and the 
model whose minimal sufficient statistics are the two-way marginals corre- 
sponding with the pairs of variables 

(5.1) {fg, ef, dh, dg, eg, cf, ce, bh, be, bd, ag, ae, ad, ac}. 

We ran the Markov chain Monte Carlo sampler from Section 4.1 for 
250,000 iterations from 100 random starting graphs. The burn-in time was 
25,000 iterations. Convergence to the stationary distribution (4.5) is illus- 
trated in Figure 1 that gives the posterior expected number of edges in the 
CGGM graphs across iterations for each chain. The sampled graphs have on 
average 16.5 edges which represent approximately 59% of the total number 
of possible edges. By comparison, the log-linear model (5.1) has 14 minimal 
sufficient statistics. 
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Log Iteration 

Fig. 1. Estimates of the posterior expected number of edges in the CGGMs for the 
Rochdale data. 

In order to show the importance of modehng the conditional independence 
relationships among the latent variables using graphs, we have also employed 
the copula estimation approach proposed by Hoff (2007) — see equation (4.4). 
Hoff's method is equivalent to starting the Markov chain from Section 4.1 
at the full graph and never updating this graph by skipping step 3 of the 
algorithm. Moreover, updating the precision matrix from step 2 is performed 
by direct sampling from the Wishart posterior Wp{5 + n,D + Yl'^=i z^^^z^^^'^). 
This simplified Markov chain was run for 25 million iterations and henceforth 
is called the Copula-Full model. 

We compare the expected cell counts of the all two-way interaction log- 
linear model, the log-linear model (5.1), the Copula- Full model and the 
CGGMs. Table 2 shows the cells containing the 20 largest observed counts 
together with their corresponding estimates. It is remarkable that the CG- 
GMs perform as well as the all two-way interaction model for the largest 
cell count 57. The squared errors between the observed counts and the ex- 
pected cell counts for all the 256 cells in the table are the following: 284.79 
for the all two-way interaction model, 407.04 for the CGGMs, 905.78 for the 
model (5.1) and 1919.15 for the Copula-Full model. 

In Table 3 we show the pairwise correlations T„^ .t,2 and the posterior inclu- 
sion probabilities of edges {vi,V2) for any two latent variables and Z^^ 
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Table 2 

Expected cell counts for the top 20 largest counts cells associated with the all two-way 
interaction log-linear model, Whittaker's log-Unear model (5.1), the Copula-Full model 
and the CGGMs in the Rochdale data. Here 1 stands for no and 2 stands for yes 
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as estimated using the CGGMs. In Table 4 we give the estimates of the 
pairwise correlations T^^^^j obtained using the Copula- Full model. We see 
that the absolute values of these estimates are significantly smaller than 
corresponding absolute values of the CGGMs estimates. We show the de- 
pendence structure of the observed variables in Tables 5 and 6. We give 

Table 3 

Estimated correlations (elements under the main diagonal) and posterior inclusion 
probabilities of edges (elements above the mam diagonal) associated with the CGGMs in 

the Rochdale data 
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Table 4 

Estimated correlations (elements under the main diagonal) associated with the 
Copula-Full model in the Rochdale data 
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the posterior means of Cramer's V Pvi,v2 estimates of the posterior 
probabihties p{Hl^p^'^\x^^'^^) with H^^^^ : Pvi,v2 > 0.1. By contrasting the es- 
timates obtained using CGGMs and the Copula-Full model, we clearly see 
that conditioning on the full graph is quite disadvantageous: the Cramer's V 
associations are severely underestimated and, subsequently, all the posterior 
probabilities p{H^^^^^\x^^''"'^) are almost zero under the full graph. The CG- 
GMs take every possible graph into account and the corresponding estimates 
are produced by Bayesian model averaging across all graphs. This leads to 
more appropriate results as evidenced in Tables 3-6. 

Whittaker (1990), page 282, argues that the strongest pairwise interac- 
tion in the Rochdale data is {b,d), followed by {b,h), (e,/) and {a,g). In 
Table 3 we see that the top four posterior inclusion probabilities in the CG- 
GMs are as follows: 1 for {b,d), 0.96 for 0.98 for (e,/) and 1 for (0,5). 
The strongest associations in the observed variables space as measured by 

Table 5 

Estimated Cramer's V associations (elements under the main diagonal) and posterior 
probabilities p{Hi^p\x^^'"^) (elements above the main diagonal) associated with the 
CGGMs in the Rochdale data 
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Table 6 

Estimated Cramer's V associations (elements under the main diagonal) and posterior 
probabilities p{Hi^p\x^^''"^) (elements above the main diagonal) associated with the 
Copula-Full model in the Rochdale data 
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Cramer's V are the following: {b,h), {a,g), (e,/) and {c,g). The in- 

teraction between c and g is also present in the log-linear model (5.1). 

Of particular interest is the determination of the factors that influence 
variable a — the wife's economic activity. From Table 5 we see that vari- 
ables c, d and g are the only variables with a strictly positive posterior 
probability that their Cramer's V association with variable a is greater 
than 0.1. The largest Cramer's V association is pa,g = 0.12, followed by 
Pa,c = 0.08 and pa,d = 0.08. The corresponding estimated correlations from 
Table 3 show a negative relationship between a and each of these three vari- 
ables. Whittaker (1990) determines which variables influence a by consider- 
ing the log-linear model ac\ad\ae\ag induced by the generators of model (5.1) 
that involve a. Using maximum likelihood estimation of log-linear parame- 
ters, Whittaker obtains the following estimates of the logistic regression of 
a on c, d, e and g: 

p(cL = lie, d, e, q) 

(5.2 log ; = const. - 1.33c - 1.32d + 0.69e - 2.17^. 

p{a = 0\c, d,e,g) 

Equation (5.2) seems to support our findings based on CGGMs, as it in- 
dicates a negative association between (a, c), (a,d), {a,g), and a positive 
association between (a,e). Moreover, the association between a and e is the 
weakest of the four. The CGGMs estimate pa,e = 0.04 which is about half 
of 

Pa,c or Pa,d- The absolute values of the regression coefficients in (5.2) share 
the same pattern. 

We remark that Table 3 reports a posterior inclusion probability equal 
to 0.93 for the edge (a, 6). However, the CGGMs estimate the pairwise cor- 
relation '$a,b to be 0.15 and the Cramer's V association pa^b to be 0.01. 
Therefore, the CGGMs do not seem to indicate a relevant interaction be- 
tween variables a and b which is in line with Whittaker's findings who did 
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not include an interaction term ab in model (5.1). This represents an exam- 
ple where an edge vanishes as we move from the latent variables space to 
the observed variables space. We would expect the opposite to happen in 
most applications, that is, edges or associations could be lost when moving 
from the observed to the latent variables. 

5.2. The NLTCS functional disability data. We come back to the 2^^ 
functional disability table introduced in Section 1. Dobra, Erosheva and 
Fienberg (2003) analyze these data from a disclosure limitation perspective, 
while Fienberg et al. (2010) develop latent class (LC) models that are very 
similar to the Grade of Membership (GoM) models of Erosheva, Fienberg 
and Joutard (2007). The need to consider alternatives to log- linear models 
for the NLTCS data comes from the severe imbalance that exists among the 
cell counts in this table. The largest cell count is 3853, but most of the cells 
(62,384 or 95.19%) contain counts of zero, while 1729 (2.64%) contain counts 
of 1 and 1499 (0.76%) contain counts of 2. There are 24 cells with counts 
larger than 100, which accounts for 42% of the observed sample size 21,574. 
This gives a very small mean number of observations per cell of 0.33, which 
is indicative of an extremely high degree of sparsity that is characteristic of 
high-dimensional categorical data. 

We ran 100 replicates of the Markov chain Monte Carlo sampler from 
Section 4.1 for 500,000 iterations with a burn-in time of 50,000 iterations. 
Figure 2 shows the convergence of these Markov chains to the joint distribu- 
tion (4.5). The mean number of edges of the sampled graphs is 72 or 60% of 
the total number of edges. Table 7 compares the expected cell values of the 
six largest counts as estimated with the Grade of Membership (GoM) model 
of Erosheva, Fienberg and Joutard (2007), the latent class (LC) model of 
Fienberg et al. (2010) and the CGGMs. All three models seem to perform 

Table 7 

Expected cell counts for the top six largest counts cells in the NLTCS data. We report the 
results obtained from the GoM model [Erosheva, Fienberg and Joutard (2007)], the LC 
model [Fienberg et al. (2010)] and the CGCMs. Here 1 stands for healthy and 2 stands 

for disabled 
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Fig. 2. Estimates of the posterior expected number of edges m the CGGMs for the NLTCS 
functional disability data. 

comparably well in terms of capturing the underlying dependency patterns 
that lead to the largest counts in this 2^^ table. 

In Table 8 we show the association structure of the latent variables Z. 
We give posterior estimates of the pairwise correlations Ty-^^^^^ and posterior 
inclusion probabilities for each edge {vi,V2)- AH the estimates of the pair- 
wise correlations are quite large and strictly positive, which is intuitively 
correct: the ability to perform any activity of daily living is positively cor- 
related with the ability to perform any other activity. In Table 9 we show 
the association structure of the observed variables X. For every pair X^^ 
and X„2, we give the posterior means of Pvi,v2 a^id estimates of the poste- 
rior probabilities p{Hl^p''^\x^^--''^) with Hl]p:py,^y^ > 0.1. The Cramer's V 
values indicate that independence is unlikely to hold for any pair of ob- 
served variables, which is consistent with the large positive correlations we 
estimated in the latent space. In fact, 88 pairs of observed variables have 
a Bayes factor Bp^'^^ greater than 100, which constitutes strong evidence in 
favor of the hypothesis H'^^^'"^ [Kass and Raftery (1995)]. Thus, the NLTCS 
data shows that approximately 73% pairs of ADLs and lADLs are certainly 
not independent of each other. 

The topology of the sampled graphs is indicative of the relative impor- 
tance of each disability measure with respect to the others in the latent vari- 
ables space. The structure of a graph can be summarized by the number of 
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Table 8 

Estimated correlations (elements under the main diagonal) and posterior inclusion probabilities of edges (elements above the main 

diagonal) in the NLTCS data 
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Table 9 

Estimated Cramer's V associations (elements under the main diagonal) and posterior probabilities p{Hi,p\x^^'"^) (elements above the 

main diagonal) in the NLTCS data 
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Fig. 3. Cumulative Cramer's V associations (x-axis) and posterior expected degrees 
(y-axis) of the 16 disability measures from the NLTCS functional disability data. 

neighbors of each vertex, that is, the number of edges that involve each vari- 
able. This is usually called the degree of a vertex. A larger degree indicates 
an increased number of interactions in which a latent variable participates. 
Since in the NLTCS data all the latent variables are positively associated 
with each other, having one disability increases the likelihood of having other 
disabilities. The degree of a variable reflects the number of disabilities that 
are not conditionally independent of this variable given the others. 

In the observed variables space we quantify the relative importance of 
a variable as the sum of the Cramer's V associations Pv-i,v2 between 
and some other variable X^^ . When computing these cumulative Cramer's V 
associations we assume that the 120 — 88 = 32 pairwise associations with 
a Bayes factor below 100 are set to zero. Figure 3 shows the posterior ex- 
pected degrees of the 16 disability measures plotted against the correspond- 
ing cumulative Cramer's V associations. We see that IADL4 (cooking) and 
lADLlO (telephoning) stand out in the latent space. Most individuals in- 
cluded in the survey (67.6%) are unable to cook, hence, there is no surprise 
that IADL4 is the second most connected variable. However, only a rela- 
tively small number of people (10.6%) cannot use the telephone on their 
own. In fact, more people are disabled with respect to any of the other 15 
measures. As such, it might be counterintuitive to see that lADLlO has the 
highest degree of connectivity. In the observed variables space the top three 
cumulative Cramer's V associations are obtained for lADLl, IADL2 and 
IADL3. We note that lADLl (doing heavy house work) and IADL2 (doing 
light house work) are nested, hence, we would expect their association scores 
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to be related. This indicates a good degree of consistency of the dependency 
structure identified by the CGGMs. Since lADLl is also highly connected 
in the latent space, Figure 3 suggests that lADLl is key to a principled 
assessment of the disability level of a person. 

The CGGMs clearly show that the 16 disability measures recorded in the 
NLTCS data should not be treated on an equal footing. Some measures such 
as lADLl or lADLlO indicate more serious disabilities than others, which is 
not necessarily reflected in the number of people reporting that particular 
disability. Simply counting the number of disabilities a person has can be 
very misleading when evaluating the overall disability level of an individual. 
This remark could shed a new light on the findings reported in Manton and 
Gu (2001) who only make the distinction between ADLs and lADLs. 

6. Discussion. The inference approach we presented in this paper ex- 
tends Gaussian graphical models to data sets in which the multivariate nor- 
mal assumption for the observed variables is unlikely to hold. The CGGMs 
capture conditional independence relationships among a set of latent vari- 
ables that are in a one-to-one relationship with the set of observed variables. 
The fact that the number of latent variables coincides with the number of 
observed variables avoids the difficult statistical issue of having to select the 
number of latent classes — see the excellent discussions in Erosheva, Fienberg 
and Joutard (2007) and Fienberg et al. (2010). 

Our goal was to model dependencies separately from the univariate margi- 
nal distribution of each variable. As such, we did not include a parametric 
representation of the marginal distributions in our framework. Pitt, Chan 
and Kohn (2006) give a Bayesian approach to model conditional indepen- 
dence relationships in Gaussian copulas in which the univariate marginal 
distributions are allowed to depend on a set of parameters and on certain 
sets of explanatory variables. There is a definite possibility to combine our 
prior specification for the precision matrix for the latent variables with the 
methods of Pitt, Chan and Kohn (2006) into a procedure that takes into 
account the uncertainty in the specification of the univariate distributions. 

The CGGMs are applicable to any observational study for the purpose of 
identifying conditional independence relationships. The only requirement is 
that the observed variables are binary, ordinal or continuous. The extended 
rank likelihood [Hoff (2007)] is a key component of our framework. A nec- 
essary condition for its correct application is that there exists an ordering 
of the possible values of any observed variable — see Section 3. Our frame- 
work does not allow the presence of discrete variables that are not binary 
or ordinal. 

Although the interactions among the latent variables do not go beyond 
second-order moments, CGGMs give sensible results in the analysis of sparse 
contingency tables because they allow inference through Bayesian model 
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averaging. By contrast, log-linear models contain higher-order interaction 
terms but model averaging is no longer an option: the same interaction term 
has a different interpretation in various log-linear models. As such, one has 
to choose one log-linear model and perform inference given this single model. 
When the sample size is small with respect to the total number of possible 
models, such a determination might not be appropriate. The data might not 
contain enough information to distinguish between log-linear models that are 
very close to each other and have almost the same posterior probability — 
see, for example, the analysis of the Rochdale data from Dobra and Massam 
(2010). Our use of CGGMs does not involve choosing one particular model, 
but averaging with respect to many models on the latent space. We hope that 
CGGMs will play a significant role in many quantitative fields of research. 
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SUPPLEMENTARY MATERIAL 

Supplement: C-I--1- implementation of copula Gaussian graphical models 

(DOL 10. 1214/10- AOAS397SUPP; .zip). We provide source code for the 
methodology described in this paper. Our program takes advantage of cluster 
computing to run several Markov chains in parallel. By using this code, one 
can replicate the analyses of the Rochdale data and the NLTCS functional 
disability data for which we give sample input files. 
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